Critical behavior in a cross-situational lexicon learning scenario 
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The associationist account for early word-learning is based on the co-occurrence between objects 
and words. Here we examine the performance of a simple associative learning algorithm for acquiring 
the referents of words in a cross-situational scenario affected by noise produced by out-of-context 
words. We find a critical value of the noise parameter 7 C above which learning is impossible. We use 
finite-size scaling to show that the sharpness of the transition persists across a region of order r _1//2 
about 7c, where r is the number of learning trials, as well as to obtain the learning error (scaling 
function) in the critical region. In addition, we show that the distribution of durations of periods 
when the learning error is zero is a power law with exponent —3/2 at the critical point. 



I. INTRODUCTION 

The problem of early word-learning has been subject 
of philosophical controversy for centuries Ij. The al- 
ways visionary Augustine argued that the child makes 
the connections between words and their referents by 
understanding the referential intentions of others, thus 
anticipating the modern theory of mind in about fifteen 
centuries [2]. In the 17th century, Locke's empiricism 
supported the associationist viewpoint, which contends 
that the mechanism of word learning is sensitivity to co- 
variation, i.e., if two events occur at the same time, they 
become associated. 

Here we examine a radical offshoot of the associ- 
ationist approach to lexicon acquisition termed cross- 
situational or observational learning [3 , which asserts 
that the meaning of a word can be determined by looking 
for something in common across all observed uses of that 
word [4]. In other words, learning takes place through 
the statistical sampling of the contexts in which a word 
appears. 

A scenario to describe the lexicon acquisition process 
should take into account the inherent ambiguity of the 
learning task (i.e., many distinct objects may be asso- 
ciated to the same word) as well as the noisy effect of 
out-of-context words (i.e., the uttered word may not re- 
fer to any object in the context). Whereas the noiseless 
scenario has been explored in great detail in the liter- 
ature [5~7J, where it was shown that the learning error 
decreases exponentially with the number of learning tri- 
als, a systematic study of the effect of noise is lacking. 

To remedy this deficiency, we modify the minimal 
model of noiseless cross-situational learning [MZ] so 
as to include the effect of noise produced by out-of- 
context words. Using Monte Carlo simulations and finite- 
size scaling we identify and characterize a critical phe- 
nomenon that separates the asymptotic regime where the 
lexicon can be acquired without errors from the regime 
where learning is impossible. At the critical noise level, 
we find that the duration of the periods with zero error 
is distributed by a power-law distribution. 



II. CROSS-SITUATIONAL LEARNING 
SCENARIO 



We assume that there are N objects, N words and a 
one-to-one mapping between words and objects. At each 
learning event, C objects are chosen at random without 
replacement from the fixed list of N objects and one of 
these objects is named according to the word-object map- 
ping. The C objects form the context which determines 
the interpretation of the uttered word and the learner's 
task is to guess which of the C objects that word refers 
to. This is then an ambiguous word-learning scenario in 
which there are multiple object candidates for any word. 
The parameter C is a measure of the ambiguity of the 
learning task. In particular, in the case C = N the word- 
object mapping is not learnable within a cross-situational 
scenario. 

A learning episode comprises a context and a single 
target word. In an uncorrupted learning episode, the 
context must exhibit the correct object (i.e., the object 
named by the target word according to the object-word 
mapping) plus C — 1 distinct mismatching objects. Noise 
is added to the learning scenario by removing the correct 
object from the context, which will then exhibit C mis- 
matching objects. Such corrupted and misguiding learn- 
ing episodes occur with probability 7 £ [0, 1]. This type 
of noise is an integrant part of any realistic learning sit- 
uation, arising usually from the unwarranted narrowing 
of the context by the learner. 

To represent the one-to-one object-word mapping we 
use the index i = 1,...,N to label the distinct objects 
and h = 1, N to label the distinct words. Then, with- 
out lack of generality, the correct mapping is defined by 
assigning object i = 1 to word h = 1, object i = 2 to 
word h — 2 and so on. The problem faced by the learner 
is to determine the correct mapping given a sequence of 
learning episodes. Next we will describe a simple (per- 
haps, the simplest) procedure to accomplish this learning 
task. 



2 



III. ASSOCIATIVE LEARNING MODEL 

We assume that learning is a change in the confidence 
with which the learner associates the target word h to 
a given object i and represent this confidence by a non- 
negative integer p^ ■ Our associative accumulator learn- 
ing procedure is described as follows. Before learning all 
confidences are set to zero, i.e., p^ = for i, h = 1, N, 
and whenever object i* appears in a context with target 
word h* the confidence Pi*h* increases by one unit [8]. 
Hence, exactly C confidence values are updated at each 
learning trial. 

To determine which object corresponds to word h the 
learner simply chooses the object index i for which p^ 
is maximum. In the case of ties, the learner selects one 
object at random among those that maximize the con- 
fidence. From the definition of the correct word-object 
mapping, our learning algorithm achieves a perfect per- 
formance when phh > Pih for all h and i ^= h. 

A critical feature of the accumulator model is that 
words are learned independently. This fact alone allows 
us to split the analysis of the vocabulary learning task in 
two parts. The first and most important part is the prob- 
lem of learning the meaning (or the referent) of a single 
word. Once this is done, we can easily solve the problem 
of learning the N words given their sampling frequencies 
[7j. Hence, in this work we will focus on the single- word 
learning problem only. 

IV. SINGLE- WORD LEARNING 

Accordingly, we consider the learning of a single word, 
say word h, which is then uttered at all learning trials r. 
We define the single- word learning error e (r) for r > as 
follows. If phh < Pih for any i ^ h then e = 1, otherwise 
if Phh = Pih for fi values of i ^ h then e = n/ (n + 1) 
with n = 0, . . . , N — 1. At r = all confidences are set 
to zero and so e = (N — 1) /N. 

In the noiseless case (7 = 0) we have phh > Pih for 
all i 7^ h since object i — h is always part of the con- 
text. So errors are due to ties phh = Pih>i 7^ h only. In 
fact, it can be shown analytically that in this case the 
average learning error vanishes like [(C — 1) / (N — l)] r 
for large r [5HZ] • As expected, for C — 1 we have e = 
at the first learning trial r — 1 already, but more inter- 
estingly is that learning becomes faster with increasing 
N. This apparently counterintuitive result has a simple 
explanation: a large list of objects to select from actually 
decreases the odds of choosing the same confounding ob- 
ject during the learning trials, thus reducing the number 
of ties. However, this decrease is overcompensated by the 
sampling effect when we consider the problem of learning 
the entire vocabulary and then learning slows down as N 
increases, as expected [7]- 

In the case the contexts are corrupted by noise with a 
probability 7 an analytical approach is not possible and 
we have to resort to simulations to study the stochastic 
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FIG. 1: Learning error against the number of learning trials 
r for a single sample of the learning process using the accu- 
mulator learning model. The parameters are N = 20, C = 6 
and 7 = 7c = 0.7. The lines are guides to the eye. 
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FIG. 2: Average learning error (e) as function of the number 
of learning trials for N — 5, C — 2 and (bottom to top) 7 = 
0, 0.1, 0.2, 0.9. The critical value of the noise parameter is 
7c = 0.6 at which (e c ) = 0.8. The symbols are the simulation 
results and the lines are guides to the eyes. 

learning process. Figure [l] shows a typical evolution of 
the learning error at the critical noise level. Although this 
figure reveals a rich stochastic dynamics, it is rather un- 
informative from the learning perspective. In that sense, 
the behavior of the average learning error (e) , shown in 
Fig. [2] is more relevant. For a fixed r, this average is 
calculated using typically 10 6 to 10 7 realizations of the 
learning process. 

Figure [2] reveals that learning is possible provided that 
the noise parameter does not exceed a certain threshold 
7 C . More pointedly, in the asymptotic regime r — > 00 
we find that (e) — > for 7 < -f c and that (e) — > 1 for 
7 > 7c. The surprising finding is that at 7 = j c , the 
average learning error becomes independent of r > 0. 

There is a simple reasoning to determine 7 C as well 
as the error (e c ) at this critical noise parameter. First, 
we note that the borderline between learning and non- 



learning occurs when all N objects are equally likely of 
being selected to compose the contexts. We recall that 
this is exactly the situation prior to learning and so we 
expect that 
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(e c ) = e(r = 0) = 



N - 1 
N ' 



(1) 



Accordingly, j c is determined by equating the probability 
of selecting the correct object with the probability of se- 
lecting any given incorrect object to compose the context 
in a learning episode, i.e., 



1 - 7c = (1 - 7c) 
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N - 1 



C 
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from which we get 



7c = 1- 



C 
N' 



(2) 



(3) 



These neat expressions for (e c ) and 7 C proved correct for 
a vast selection of values of N and C, but we have no 
mathematical proof of their validity, besides the argu- 
ment presented above. However, we can perform a sim- 
ple consistency check on these expressions as follows. The 
average learning error at the first trial is given by 

(e(r = l)> = (l- 7 )^+7 (4) 

and by setting 7 = 7 C we recover Eq. (|T| as it should be 
since (e c ) is independent of r (see Fig.]2|. 



V. FINITE-SIZE SCALING ANALYSIS 

Considering the 'size' of the system as the number of 
learning trials r we proceed now to examine the sharp- 
ness of the phase transition at j c using finite-size scaling 
[9]. This threshold phenomenon is best appreciated in 
Fig. [3j which exhibits the dependence of the average 
learning error on the distance to the critical parameter 
for different values of r. As the number of trials r in- 
creases, the difference between the regimes 7 < j c and 
7 > 7 C becomes evident. All curves intersect at 7 = 7 C 
for which the average error is a constant given by Eq. ([lj . 

The key insight is obtained when one considers the av- 
erage learning error as a function of the reduced variable 
(7c — 7) t 1 / 2 , as exhibited in Fig. [I] Use of this reduced 
variable produces the collapse of the data for different r 
into a single scaling function, which depends on the val- 
ues of N and C only. As illustrated in the figure, the 
data is fitted very well by the functional form 



-erfc 



a (N) + b (N, C) ( 7 c-7)r 1/2 



(5) 



which has a single fitting parameter, b(N,C). The pa- 
rameter a (N) is obtained by setting 7 = 7 C and then 
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FIG. 3: Average learning error as function of the distance 
to the critical noise parameter for N = 10 and C = 2. The 
symbols are the simulation results for (top to bottom in the 
positive ordinate region) r = 1, 10, 100, 200, 400 and 800. The 
lines are guides to the eyes. 
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FIG. 4: Average learning error as function of the reduced 
variable ( 7c - 7) r 1/2 for N = 10 and C = l(O), 2(d), 3(A) 
and 5(v)- The symbols are the simulation results and the 
lines are given by the scaling function (|5| with the parameter 
b obtained from the fitting of the data. 



using the expression of (e c ), given by Eq. ([!]). The final 
result is 



a (N) = erfc" 



2(N - 1) 



N 



(6) 



where erfc 1 (x) stands for the inverse complementary 
error function. We note that a (2) = and a (N) < for 
N > 2. 

We can get some insight on the fitting parameter 
b (N, C) by calculating explicitly the average learning er- 
ror for N — 2 and (7=1. In the limit r — > 00 and 
7 — >• 7 C = 1/2 such that r 1 / 2 (7 — 7 C ) is finite, we find to 
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FIG. 5: Dependence of the fitting parameter b on the ratio 

7 C for N = 2(x), N = 10(O), N = 20(d), N = 30(A) and FI G. 6: Distribution of stases for N = 20, C = 6, 7 = >y c = 
N = 40(v)- The solid line is given by Eq. Q. 0.7, and (bottom to top) r = lO 3 , 10 4 and 10 5 . The slope of 

the straight line is —3/2. 



the leading order 



-erfc 



r 1/2 (7c -7) 
[27c(l-7e)] 1/2 



(7) 



Hence we assume that b (N, C) = b ( 7c ) and plot this fit- 
ting parameter in Fig. [5] for a large selection of values of 
N and C. More pointedly, for each value of N (repre- 
sented by different symbols in the figure) we vary C from 
1 to N — 1 to obtain scaling functions as those shown in 
Fig. [4] Then these functions are fitted using Eq. |5| m 
order to determine the fitting parameter b. For N > 4 
the data is fitted very well by the function 



6 (7c) 



b' 



bc(i-7c)] 



1/2 



(8) 



with b' = 0.65. Note that for N = 2 we have b' = l/s/2 m 
0.71. 

Figure [5] reveals a most interesting symmetry: for fixed 
N the average learning error when plotted against the 
reduced variable {p/ c — 7) t 1 / 2 is invariant to the change 
C — > N — C which implies j c —¥ 1 — j c . In particular, 
in Fig. [4] the results for C = 9 are identical to those 
displayed for C — 1, the results for C = 8 to those for 
C = 2 and so on. However, we must note that this 
symmetry is exact only in the limits r — > 00 and 7 — > j c . 

For an infinitely large lexicon, N — > 00, we have 
a (N) ~ — In 1 / 2 iV and so (e) — > 1 if the context size 
C grows linearly with N (i.e., j c is nonzero), but (e) — > 
if C remains finite since in this case b ~ Af 1 / 2 diverges 
faster than a (N). 



VI. STATISTICS OF STASIS 

A distinctive feature of the learning process revealed 
by Fig. [TJ is the existence of long periods when the learn- 
ing error stands at zero value, i.e., phh > Pih for all ob- 
jects i h. These periods or stases are characterized 



by repeated additions of credence units to the confidence 
values and they end when one (or more) of the iV — 1 
confidences Pih, i 7^ h, equals Phh- 

We begin the analysis of the distribution P c (At) of the 
durations Ar of the stases at the critical parameter "f c by 
showing in Fig. [6] how the total number of learning trials 
To (basically a cutoff time) affects this distribution. The 

rescaling t^ 2 P c (At/t ) makes the results essentially in- 
dependent of the cutoff parameter tq provided At/tq is 
not too small (data not shown). The curves exhibit a 
clear power law behavior with exponent —3/2, which is 
the mean-field exponent for the size of avalanches in self- 
organized critical models [ID] . 

In addition, we find that away from the critical point 
the distribution P (At) is exponential and that the aver- 
age duration of the stases diverges like (At) ~| j c — 7 | _1 
as 7 -> 7 C . 

As expected, these mean-field critical exponents are 
robust to changes in the model parameters N and C. 
In fact, for N = 2 and C = 1 the distribution P (At) 
can be easily calculated analytically for any value of 
7 since this is the classical ruin problem in which a 
gambler with initial capital z = 1 plays against an in- 
finitely rich adversary. The results for the duration of 
the game At are simply P c (At) ps (2/t:) 1 ^ 2 (At) -3 ^ 2 
and (At) = (1/2) | 7c - 7 I" 1 (see Chapter XIV of PJJ). 

Changes in the number of objects N have no significant 
influence on P c (At) whereas changes in the context size 
C produce a shift on the distribution, without affecting 
the power-law exponent, as illustrated in Fig. [7] In fact, 
increase of C increases the frequency of short stases and, 
consequently, reduces the frequency of long ones. This 
is expected since the larger the context size, the greater 
the number of mismatching objects that have their confi- 
dences updated, and so the greater the odds of occurrence 
of the jump condition p^ > Phh for some object i 7^ h. 

Finally, we note that although we have focused on the 
periods of the learning process when the error learning is 
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FIG. 7: Distribution of stases for N = 20, r = 10 5 and 
(bottom to top at Ar = 1) C = 1,2,5. The slope of the 
straight line is —3/2. 

0, the very same conclusions hold for the periods when 
the learning error is 1. 

VII. CONCLUSION 

The view of language as a collective phenomenon aris- 
ing out of local social interactions has prompted its mod- 
eling and investigation through statistical physics con- 
cepts and tools! 12]. Words have been likened to genes 
and their evolution studied within a population genet- 
ics framework jT3l [14] , whereas the competition between 
whole languages has been considered using population 
dynamics models [T5HTT] . The study of the bootstrap of 
a common lexicon among a large population of individu- 
als has revealed a sharp phase transition towards shared 
conventions |18) as well as an unexpected connection with 
random occupancy problems in the case only two indi- 



viduals interact but the lexicon size is very large|19j. 

The problem of acquiring, rather than bootstrapping, 
a fixed lexicon from observational learning is relevant to 
developmental psychology since it allows a quantitative 
appraisal of the associationist hypothesis on early-word 
learning pQ. In particular, we show that the utterance 
of out-of-context words may result in severe limitations 
to learning, depending on the ratio C/N between the 
number of objects presented to the learner at a learn- 
ing trial and the total number of objects. If this ratio 
is small (i.e., 7 C is close to 1) then this noisy effect is 
largely irrelevant and the lexicon can quickly be learned 
to perfection. However, for large values of this ratio (i.e., 
7 C is close to 0) learning becomes impossible regardless 
of the number of trials r. Finite-size scaling shows that 
the threshold phenomenon persists across a region of size 
t -1 / 2 around 7 C and offers the explicit functional form 
of the learning error in this region. 

The simplicity of our associative learning algorithm al- 
lowed us to consider the learning of the distinct words as 
independent stochastic processes. Interactions between 
words, such as the mutual exclusivity constraint that in- 
structs children to associate novel words to unnamed ob- 
jects pQ, are well-established in developmental psychol- 
ogy and it would be interesting to see whether and how 
they alter the characteristics of the critical phenomenon 
reported here. 
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